In this practical, we will show an example of loading pre-trained word vectors and fine-tune them for the purpose of sentiment classification on movie reviews. We use the following packages:
library(text2vec)
library(tidyverse)
library(tidytext)
Besides these packages, we need to install the TensorFlow and Keras packages for R.
The TensorFlow package provides code completion and inline help for the TensorFlow API when running within the RStudio IDE. The TensorFlow API is composed of a set of Python modules that enable constructing and executing TensorFlow graphs.
Install the TensorFlow R package from GitHub as follows:
# devtools::install_github("rstudio/tensorflow")
Then, use the install_tensorflow() function to install TensorFlow:
library(tensorflow)
# install_tensorflow(package_url = "https://pypi.python.org/packages/b8/d6/af3d52dd52150ec4a6ceb7788bfeb2f62ecb6aa2d1172211c4db39b349a2/tensorflow-1.3.0rc0-cp27-cp27mu-manylinux1_x86_64.whl#md5=1cf77a2360ae2e38dd3578618eacc03b")
The provided url just installs the latest TensorFlow version, you can also run this line without providing any argument to the install_tensorflow function.
Finally, you can confirm that the installation succeeded with:
tmr <- tf$constant("Text Mining with R!")
print(tmr)
## tf.Tensor(b'Text Mining with R!', shape=(), dtype=string)
This will provide you with a default installation of TensorFlow suitable for getting started with the TensorFlow R package. See the article on installation (https://tensorflow.rstudio.com/installation/) to learn about more advanced options, including installing a version of TensorFlow that takes advantage of Nvidia GPUs if you have the correct CUDA libraries installed.
To install the Keras package you first run either of the following lines:
# install.packages("keras")
# devtools::install_github("rstudio/keras")
Then, use the install_keras() function to install Keras. The Keras R interface uses the TensorFlow backend engine by default. This will provide you with default CPU-based installations of Keras and TensorFlow. If you want a more customized installation, e.g. if you want to take advantage of NVIDIA GPUs, see the documentation for install_keras() and the article on installation (https://tensorflow.rstudio.com/installation/).
The ISLR authors also prepared an installation guide to Python, Reticulate and Keras: https://web.stanford.edu/~hastie/ISLR2/keras-instructions.html
Now we have TensorFlow and Keras ready for fine-tuning pre-trained word embeddings for sentiment classification on movie reviews.
Rememebr to load the Keras library:
library(keras)
##
## Attaching package: 'keras'
## The following objects are masked from 'package:text2vec':
##
## fit, normalize
For sentiment classification with pre-trained word vectors, we want to use GloVe pretrained word vectors. These word vectors were trained on Wikipedia 2014 and Gigaword 5 containing 6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors. Download the glove.6B.300d.txt file manually from the website or use the code below for this purpose.
# Download Glove vectors if necessary
# if (!file.exists('glove.6B.zip')) {
# download.file('https://nlp.stanford.edu/data/glove.6B.zip',destfile = 'glove.6B.zip')
# unzip('glove.6B.zip')
# }
# load glove vectors
vectors <- data.table::fread('data/glove.6B.300d.txt', data.table = F, encoding = 'UTF-8')
colnames(vectors) <- c('word', paste('dim',1:300,sep = '_'))
# convert vectors to dataframe
vectors <- as_tibble(vectors)
text2vec package. This data set consists of 5000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews. Load this data set and convert it to a dataframe.# load an example dataset from text2vec
data("movie_review")
as_tibble(movie_review)
Keras, let’s first define the hyperparameters. Define the parameters of your Keras model with a maximum of 10000 words, maxlen of 60 and word embedding size of 300 (if you had memory problems change the embedding dimension to a smaller value, e.g., 50).max_words <- 1e4
maxlen <- 60
dim_size <- 300
Keras and tokenize the imdb review data using a maximum of 10000 words.# tokenize the input data and then fit the created object
word_seqs <- text_tokenizer(num_words = max_words) %>%
fit_text_tokenizer(movie_review$review)
pad_sequences function to pad the sequences.# apply tokenizer to the text and get indices instead of words
# later pad the sequence
x_train <- texts_to_sequences(word_seqs, movie_review$review) %>%
pad_sequences(maxlen = maxlen)
# unlist word indices
word_indices <- unlist(word_seqs$word_index)
# then place them into data.frame
dic <- data.frame(word = names(word_indices), key = word_indices, stringsAsFactors = FALSE) %>%
arrange(key) %>% .[1:max_words,]
# join the words with GloVe vectors and
# if a word does not exist in GloVe, then fill NA's with 0
word_embeds <- dic %>% left_join(vectors) %>% .[,3:302] %>% replace(., is.na(.), 0) %>% as.matrix()
sentiment column in the original dataframe and name it y_train.# the outcome variable
y_train <- as.matrix(movie_review$sentiment)
Keras functional API and create a neural network model as below. Can you describe this model?# Use Keras Functional API
input <- layer_input(shape = list(maxlen), name = "input")
model <- input %>%
layer_embedding(input_dim = max_words, output_dim = dim_size, input_length = maxlen,
# put weights into list and do not allow training
weights = list(word_embeds), trainable = FALSE) %>%
layer_spatial_dropout_1d(rate = 0.2) %>%
bidirectional(
layer_gru(units = 80, return_sequences = TRUE)
)
max_pool <- model %>% layer_global_max_pooling_1d()
ave_pool <- model %>% layer_global_average_pooling_1d()
output <- layer_concatenate(list(ave_pool, max_pool)) %>%
layer_dense(units = 1, activation = "sigmoid")
model <- keras_model(input, output)
# model summary
# instead of accuracy we can use "AUC" metrics from "tensorflow.keras"
model %>% compile(
optimizer = "adam", # optimizer = optimizer_rmsprop(),
loss = "binary_crossentropy",
metrics = tensorflow::tf$keras$metrics$AUC() # metrics = c('accuracy')
)
history <- model %>% keras::fit(
x_train, y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.2
)
plot(history)